Homework 1

Author

Chih-Chan (Jessica) Lan

Primary Question: whether daily concentrations of PM 2.5 decreased in California over the 20 years spanning from 2002 to 2022.

Questoin 1

(30 points) Given the formulated question from the assignment description, you will now conduct EDA Checklist items 1-5. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website, then read the data into R. For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check the distribution of the key variable we are analyzing (PM 2.5). Write up a summary of all of your findings.

Download and read the data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
folder <- "/Users/cclan/Documents/2025.09 Fall/PM566/PM566labs/HW1"

df_2002 <- read.csv(file.path(folder, "ad_viz_plotval_data_2002.csv"))
df_2022 <- read.csv(file.path(folder, "ad_viz_plotval_data_2022.csv"))

Check the 2002 PM2.5 data

dim(df_2002)
[1] 15976    22
head(df_2002)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
1 01/05/2002    AQS 60010007   1                           25.1 ug/m3 LC
2 01/06/2002    AQS 60010007   1                           31.6 ug/m3 LC
3 01/08/2002    AQS 60010007   1                           21.4 ug/m3 LC
4 01/11/2002    AQS 60010007   1                           25.9 ug/m3 LC
5 01/14/2002    AQS 60010007   1                           34.5 ug/m3 LC
6 01/17/2002    AQS 60010007   1                           41.0 ug/m3 LC
  Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1              81       Livermore               1              100
2              93       Livermore               1              100
3              74       Livermore               1              100
4              82       Livermore               1              100
5              98       Livermore               1              100
6             115       Livermore               1              100
  AQS.Parameter.Code AQS.Parameter.Description Method.Code
1              88101  PM2.5 - Local Conditions         120
2              88101  PM2.5 - Local Conditions         120
3              88101  PM2.5 - Local Conditions         120
4              88101  PM2.5 - Local Conditions         120
5              88101  PM2.5 - Local Conditions         120
6              88101  PM2.5 - Local Conditions         120
                     Method.Description CBSA.Code
1 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
2 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
3 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
4 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
5 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
6 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                          CBSA.Name State.FIPS.Code      State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site.Latitude Site.Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(df_2002)
            Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
15971 12/10/2002    AQS 61131003   1                             15 ug/m3 LC
15972 12/13/2002    AQS 61131003   1                             15 ug/m3 LC
15973 12/22/2002    AQS 61131003   1                              1 ug/m3 LC
15974 12/25/2002    AQS 61131003   1                             23 ug/m3 LC
15975 12/28/2002    AQS 61131003   1                              5 ug/m3 LC
15976 12/31/2002    AQS 61131003   1                              6 ug/m3 LC
      Daily.AQI.Value      Local.Site.Name Daily.Obs.Count Percent.Complete
15971              62 Woodland-Gibson Road               1              100
15972              62 Woodland-Gibson Road               1              100
15973               6 Woodland-Gibson Road               1              100
15974              77 Woodland-Gibson Road               1              100
15975              28 Woodland-Gibson Road               1              100
15976              33 Woodland-Gibson Road               1              100
      AQS.Parameter.Code AQS.Parameter.Description Method.Code
15971              88101  PM2.5 - Local Conditions         117
15972              88101  PM2.5 - Local Conditions         117
15973              88101  PM2.5 - Local Conditions         117
15974              88101  PM2.5 - Local Conditions         117
15975              88101  PM2.5 - Local Conditions         117
15976              88101  PM2.5 - Local Conditions         117
                         Method.Description CBSA.Code
15971 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15972 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15973 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15974 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15975 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15976 R & P Model 2000 PM2.5 Sampler w/WINS     40900
                                    CBSA.Name State.FIPS.Code      State
15971 Sacramento--Roseville--Arden-Arcade, CA               6 California
15972 Sacramento--Roseville--Arden-Arcade, CA               6 California
15973 Sacramento--Roseville--Arden-Arcade, CA               6 California
15974 Sacramento--Roseville--Arden-Arcade, CA               6 California
15975 Sacramento--Roseville--Arden-Arcade, CA               6 California
15976 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County.FIPS.Code County Site.Latitude Site.Longitude
15971              113   Yolo      38.66121      -121.7327
15972              113   Yolo      38.66121      -121.7327
15973              113   Yolo      38.66121      -121.7327
15974              113   Yolo      38.66121      -121.7327
15975              113   Yolo      38.66121      -121.7327
15976              113   Yolo      38.66121      -121.7327
summary(df_2002)
     Date              Source             Site.ID              POC       
 Length:15976       Length:15976       Min.   :60010007   Min.   :1.000  
 Class :character   Class :character   1st Qu.:60290014   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :60590007   Median :1.000  
                                       Mean   :60549600   Mean   :1.581  
                                       3rd Qu.:60731002   3rd Qu.:1.000  
                                       Max.   :61131003   Max.   :6.000  
                                                                         
 Daily.Mean.PM2.5.Concentration    Units           Daily.AQI.Value 
 Min.   :  0.00                 Length:15976       Min.   :  0.00  
 1st Qu.:  7.00                 Class :character   1st Qu.: 39.00  
 Median : 12.00                 Mode  :character   Median : 56.00  
 Mean   : 16.12                                    Mean   : 59.28  
 3rd Qu.: 20.50                                    3rd Qu.: 72.00  
 Max.   :104.30                                    Max.   :185.00  
                                                                   
 Local.Site.Name    Daily.Obs.Count Percent.Complete AQS.Parameter.Code
 Length:15976       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88215     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS.Parameter.Description  Method.Code  Method.Description   CBSA.Code    
 Length:15976              Min.   :117   Length:15976       Min.   :12540  
 Class :character          1st Qu.:120   Class :character   1st Qu.:23420  
 Mode  :character          Median :120   Mode  :character   Median :40140  
                           Mean   :297                      Mean   :33270  
                           3rd Qu.:707                      3rd Qu.:41740  
                           Max.   :810                      Max.   :49700  
                                                            NA's   :929    
  CBSA.Name         State.FIPS.Code    State           County.FIPS.Code
 Length:15976       Min.   :6       Length:15976       Min.   :  1.00  
 Class :character   1st Qu.:6       Class :character   1st Qu.: 29.00  
 Mode  :character   Median :6       Mode  :character   Median : 59.00  
                    Mean   :6                          Mean   : 54.78  
                    3rd Qu.:6                          3rd Qu.: 73.00  
                    Max.   :6                          Max.   :113.00  
                                                                       
    County          Site.Latitude   Site.Longitude  
 Length:15976       Min.   :32.63   Min.   :-124.2  
 Class :character   1st Qu.:34.07   1st Qu.:-121.4  
 Mode  :character   Median :35.36   Median :-119.1  
                    Mean   :36.00   Mean   :-119.4  
                    3rd Qu.:37.77   3rd Qu.:-117.9  
                    Max.   :41.71   Max.   :-115.5  
                                                    
hist(df_2002$Daily.Mean.PM2.5.Concentration)

Check the 2022 PM2.5 data

dim(df_2022)
[1] 59918    22
head(df_2022)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
1 01/01/2022    AQS 60010007   3                           12.7 ug/m3 LC
2 01/02/2022    AQS 60010007   3                           13.9 ug/m3 LC
3 01/03/2022    AQS 60010007   3                            7.1 ug/m3 LC
4 01/04/2022    AQS 60010007   3                            3.7 ug/m3 LC
5 01/05/2022    AQS 60010007   3                            4.2 ug/m3 LC
6 01/06/2022    AQS 60010007   3                            3.8 ug/m3 LC
  Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1              58       Livermore               1              100
2              60       Livermore               1              100
3              39       Livermore               1              100
4              21       Livermore               1              100
5              23       Livermore               1              100
6              21       Livermore               1              100
  AQS.Parameter.Code AQS.Parameter.Description Method.Code
1              88101  PM2.5 - Local Conditions         170
2              88101  PM2.5 - Local Conditions         170
3              88101  PM2.5 - Local Conditions         170
4              88101  PM2.5 - Local Conditions         170
5              88101  PM2.5 - Local Conditions         170
6              88101  PM2.5 - Local Conditions         170
                    Method.Description CBSA.Code
1 Met One BAM-1020 Mass Monitor w/VSCC     41860
2 Met One BAM-1020 Mass Monitor w/VSCC     41860
3 Met One BAM-1020 Mass Monitor w/VSCC     41860
4 Met One BAM-1020 Mass Monitor w/VSCC     41860
5 Met One BAM-1020 Mass Monitor w/VSCC     41860
6 Met One BAM-1020 Mass Monitor w/VSCC     41860
                          CBSA.Name State.FIPS.Code      State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site.Latitude Site.Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(df_2022)
            Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
59913 12/01/2022    AQS 61131003   1                            3.4 ug/m3 LC
59914 12/07/2022    AQS 61131003   1                            3.8 ug/m3 LC
59915 12/13/2022    AQS 61131003   1                            6.0 ug/m3 LC
59916 12/19/2022    AQS 61131003   1                           34.8 ug/m3 LC
59917 12/25/2022    AQS 61131003   1                           23.2 ug/m3 LC
59918 12/31/2022    AQS 61131003   1                            1.0 ug/m3 LC
      Daily.AQI.Value      Local.Site.Name Daily.Obs.Count Percent.Complete
59913              19 Woodland-Gibson Road               1              100
59914              21 Woodland-Gibson Road               1              100
59915              33 Woodland-Gibson Road               1              100
59916              99 Woodland-Gibson Road               1              100
59917              77 Woodland-Gibson Road               1              100
59918               6 Woodland-Gibson Road               1              100
      AQS.Parameter.Code AQS.Parameter.Description Method.Code
59913              88101  PM2.5 - Local Conditions         145
59914              88101  PM2.5 - Local Conditions         145
59915              88101  PM2.5 - Local Conditions         145
59916              88101  PM2.5 - Local Conditions         145
59917              88101  PM2.5 - Local Conditions         145
59918              88101  PM2.5 - Local Conditions         145
                                         Method.Description CBSA.Code
59913 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59914 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59915 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59916 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59917 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59918 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                    CBSA.Name State.FIPS.Code      State
59913 Sacramento--Roseville--Arden-Arcade, CA               6 California
59914 Sacramento--Roseville--Arden-Arcade, CA               6 California
59915 Sacramento--Roseville--Arden-Arcade, CA               6 California
59916 Sacramento--Roseville--Arden-Arcade, CA               6 California
59917 Sacramento--Roseville--Arden-Arcade, CA               6 California
59918 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County.FIPS.Code County Site.Latitude Site.Longitude
59913              113   Yolo      38.66121      -121.7327
59914              113   Yolo      38.66121      -121.7327
59915              113   Yolo      38.66121      -121.7327
59916              113   Yolo      38.66121      -121.7327
59917              113   Yolo      38.66121      -121.7327
59918              113   Yolo      38.66121      -121.7327
summary(df_2022)
     Date              Source             Site.ID              POC        
 Length:59918       Length:59918       Min.   :60010007   Min.   : 1.000  
 Class :character   Class :character   1st Qu.:60290019   1st Qu.: 1.000  
 Mode  :character   Mode  :character   Median :60631006   Median : 3.000  
                                       Mean   :60564225   Mean   : 3.768  
                                       3rd Qu.:60731026   3rd Qu.: 3.000  
                                       Max.   :61131003   Max.   :24.000  
                                                                          
 Daily.Mean.PM2.5.Concentration    Units           Daily.AQI.Value 
 Min.   : -6.700                Length:59918       Min.   :  0.00  
 1st Qu.:  4.100                Class :character   1st Qu.: 23.00  
 Median :  6.800                Mode  :character   Median : 38.00  
 Mean   :  8.414                                   Mean   : 39.22  
 3rd Qu.: 10.700                                   3rd Qu.: 54.00  
 Max.   :302.500                                   Max.   :454.00  
                                                                   
 Local.Site.Name    Daily.Obs.Count Percent.Complete AQS.Parameter.Code
 Length:59918       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88192     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS.Parameter.Description  Method.Code  Method.Description   CBSA.Code    
 Length:59918              Min.   :143   Length:59918       Min.   :12540  
 Class :character          1st Qu.:170   Class :character   1st Qu.:31080  
 Mode  :character          Median :170   Mode  :character   Median :40140  
                           Mean   :338                      Mean   :34971  
                           3rd Qu.:707                      3rd Qu.:41860  
                           Max.   :810                      Max.   :49700  
                                                            NA's   :4569   
  CBSA.Name         State.FIPS.Code    State           County.FIPS.Code
 Length:59918       Min.   :6       Length:59918       Min.   :  1.00  
 Class :character   1st Qu.:6       Class :character   1st Qu.: 29.00  
 Mode  :character   Median :6       Mode  :character   Median : 63.00  
                    Mean   :6                          Mean   : 56.28  
                    3rd Qu.:6                          3rd Qu.: 73.00  
                    Max.   :6                          Max.   :113.00  
                                                                       
    County          Site.Latitude   Site.Longitude  
 Length:59918       Min.   :32.58   Min.   :-124.2  
 Class :character   1st Qu.:34.07   1st Qu.:-121.4  
 Mode  :character   Median :36.49   Median :-119.6  
                    Mean   :36.25   Mean   :-119.6  
                    3rd Qu.:37.96   3rd Qu.:-117.9  
                    Max.   :41.76   Max.   :-115.5  
                                                    
hist(df_2022$Daily.Mean.PM2.5.Concentration)

library(ggplot2)
library(patchwork)

ymax <- max(
  df_2002$Daily.Mean.PM2.5.Concentration,
  df_2022$Daily.Mean.PM2.5.Concentration
)

p1 <- df_2002 %>%
  ggplot(aes(x = 1, y = Daily.Mean.PM2.5.Concentration)) + 
  geom_violin(fill = "red", alpha = 0.5) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
  labs(title = "PM2.5 Distribution in 2002", 
       y = "Daily Mean PM2.5 Concentration", x = NULL) +
  coord_cartesian(ylim = c(0, ymax)) +
  theme_bw()

p2 <- df_2022 %>%
  ggplot(aes(x = 1, y = Daily.Mean.PM2.5.Concentration)) + 
  geom_violin(fill = "blue", alpha = 0.5) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
  labs(title = "PM2.5 Distribution in 2022", 
       y = "Daily Mean PM2.5 Concentration", x = NULL) +
  coord_cartesian(ylim = c(0, ymax)) +
  theme_bw()

p1 + p2

Summary: The 2002 dataset contains 15,976 records and 22 variables, including 9 character variables and 13 numeric variables. Some values for Core-Based Statistical Areas (CBSA) are missing. The histogram of PM2.5 shows a strongly right-skewed distribution.

The 2022 dataset contains 59,918 records and 22 variables. Similar to 2002, there are missing values in the CBSA code. Interestingly, the histogram of PM2.5 also shows a strongly right-skewed distribution, but it additionally includes negative PM2.5 values. Upon reviewing the data source, this may be due to instrumental factors. It is also worth noting that there are some outliers in the 2022 PM2.5 concentration data.

Reference: https://pmc.ncbi.nlm.nih.gov/articles/PMC10497433/

Question 2

(10 points) Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.

df_2002$Year <- substr(df_2002$Date, 7, 10)
df_2022$Year <- substr(df_2022$Date, 7, 10)

df <- bind_rows(df_2002, df_2022)
df <- df %>%
  rename(PM2.5 = Daily.Mean.PM2.5.Concentration,
         Lon = Site.Longitude,
         Lat = Site.Latitude)

I changed the name of Daily.Mean.PM2.5.Concentration to PM2.5, Site.Longitude to Lon, and Site.Latitude to Lat.

Question 3

(20 points) Create a basic map (or maps) in leaflet showing the locations of the monitoring sites, using different colors for each year. Summarize the spatial distribution of the sites. Does this distribution change from 2002 to 2022?

library(leaflet)
library(htmltools)

map_2002 <- df %>%
  filter(Year == "2002") %>%
  leaflet() %>%
  addProviderTiles('OpenStreetMap') %>% 
  addCircles(lat=~Lat,lng=~Lon, opacity=1, fillOpacity=1, radius=100, color="red") %>%
  addControl("<b>Map of Monitoring Sites in 2002</b>", position = "topright")

map_2022 <- df %>%
  filter(Year == "2022") %>%
  leaflet() %>%
  addProviderTiles('OpenStreetMap') %>% 
  addCircles(lat=~Lat,lng=~Lon, opacity=1, fillOpacity=1, radius=100, color="blue") %>%
  addControl("<b>Map of Monitoring Sites in 2022</b>", position = "topright")

leaflet_grid <- 
  tagList(
    tags$table(width = "100%",
      tags$tr(
        tags$td(map_2002),
        tags$td(map_2022)
      )
    )
  )

leaflet_grid
pal <- colorFactor(palette = c("red", "blue"), domain = df$Year)

leaflet() %>%
  addTiles() %>%
  addCircleMarkers(data = df %>% filter(Year == 2002),
                   ~Lon, ~Lat, color = "red", group = "2002", popup = ~County,
                   opacity = 0.5, fillOpacity = 0.5) %>%
  addCircleMarkers(data = df %>% filter(Year == 2022),
                   ~Lon, ~Lat, color = "blue", group = "2022", popup = ~County,
                   opacity = 0.5, fillOpacity = 0.5) %>%
  addLegend("bottomright", pal = pal, values = df$Year, title = "Year") %>%
  addLayersControl(overlayGroups = c("2002", "2022"))
df %>%
  distinct(Year, Site.ID) %>%
  count(Year, name = "n_sites")
  Year n_sites
1 2002     103
2 2022     167

Summary: From the maps of monitoring station distribution, it was difficult to make a clear comparison between the two years. To address this, I overlaid the maps of 2002 and 2022, which allowed for a more direct comparison. The results show that there are more sites in 2022, with most concentrated in the southern part of California. A comparison of the number of monitoring sites also confirms this observation.

Question 4

(10 points) Check for any data issues such as missing or implausible values of PM in the combined dataset. Calculate the proportion of missing/implausible values for each year and report any temporal patterns you see in these observations.

df <- df %>%
  mutate(PM_cat = case_when(
    PM2.5 <=  50 ~ "Good",
    PM2.5 <= 100 ~ "Moderate",
    PM2.5 <= 150 ~ "Unhealthy for Sensitive Groups",
    PM2.5 <= 200 ~ "Unhealthy",
    PM2.5 <= 300 ~ "Very Unhealthy",
    TRUE         ~ "Hazardous"
  ))
pm_cat_pct <- df %>%
  filter(!is.na(PM2.5)) %>%
  count(Year, PM_cat) %>%
  group_by(Year) %>%
  mutate(pct = 100 * n / sum(n)) %>%
  select(Year, PM_cat, n, pct)

pm_cat_pct_wide <- pm_cat_pct %>%
  mutate(pct = round(pct, 2)) %>%
  select(Year, PM_cat, pct) %>%
  pivot_wider(
    names_from = Year,
    values_from = pct,
    values_fill = 0
  )

pm_cat_pct_wide
# A tibble: 6 × 3
  PM_cat                         `2002` `2022`
  <chr>                           <dbl>  <dbl>
1 Good                            96.2   99.8 
2 Moderate                         3.76   0.17
3 Unhealthy for Sensitive Groups   0.01   0.04
4 Hazardous                        0      0   
5 Unhealthy                        0      0.01
6 Very Unhealthy                   0      0.01
summary_stats <- df %>%
  group_by(Year) %>%
  summarise(
    n_records   = n(),
    n_missing   = sum(is.na(PM2.5)),
    q1          = quantile(PM2.5, 0.25, na.rm = TRUE),
    median      = quantile(PM2.5, 0.5,  na.rm = TRUE),
    q3          = quantile(PM2.5, 0.75, na.rm = TRUE),
    mean        = mean(PM2.5, na.rm = TRUE),
    min         = min(PM2.5, na.rm = TRUE),
    max         = max(PM2.5, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  rowwise() %>%
  mutate(
    iqr         = q3 - q1,
    lower_bound = q1 - 1.5 * iqr,
    upper_bound = q3 + 1.5 * iqr,
    n_outliers  = sum(df$PM2.5[df$Year == Year] < lower_bound |
                      df$PM2.5[df$Year == Year] > upper_bound, na.rm = TRUE),
    pct_outlier = 100 * n_outliers / (n_records - n_missing)
  ) %>%
  ungroup()

summary_stats
# A tibble: 2 × 14
  Year  n_records n_missing    q1 median    q3  mean   min   max   iqr
  <chr>     <int>     <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2002      15976         0   7     12    20.5 16.1    0    104.  13.5
2 2022      59918         0   4.1    6.8  10.7  8.41  -6.7  302.   6.6
# ℹ 4 more variables: lower_bound <dbl>, upper_bound <dbl>, n_outliers <int>,
#   pct_outlier <dbl>

Summary: In the dataset, there were no missing values for PM2.5. For implausible values, I identified outliers using Tukey’s boxplot rule. The proportion of outliers was 6.8% in 2002 and 4.6% in 2022. It is also noteworthy that negative PM2.5 values appeared in 2022, which are likely due to instrument error or data recording issues.

Regarding temporal patterns, both the mean and median PM2.5 values decreased from 2002 to 2022, indicating an overall improvement in air quality over the past two decades. However, when classifying PM2.5 concentrations according to the U.S. Air Quality Index, 2022 shows a higher proportion of days categorized as unhealthy. This suggests that while average conditions have improved, extreme pollution events leading to very high PM2.5 concentrations may have become more frequent.

Reference: https://www.airnow.gov/aqi/aqi-basics/

Question 5

(30 points) Explore the main question of interest at three different levels of spatial resolution. Create data visualizations (e.g. boxplots, histograms, line plots, violin plots) and summary statistics that best suit each level of the analysis. Be sure to write up explanations of what you observe at each level.

Level 1: State

Examine the primary question for the entire state.

library(ggplot2)
library(patchwork)

df$PM2.5[df$PM2.5 < 0] <- 0
binwidth <- 4

xmin <- floor(min(df$PM2.5))
xmax <- ceiling(max(df$PM2.5))
xmax <- 100

mean_2002   <- mean(df$PM2.5[df$Year == "2002"], na.rm = TRUE)
median_2002   <- median(df$PM2.5[df$Year == "2002"], na.rm = TRUE)
mean_2022   <- mean(df$PM2.5[df$Year == "2022"], na.rm = TRUE)
median_2022   <- median(df$PM2.5[df$Year == "2022"], na.rm = TRUE)

p1 <- df %>%
  filter(Year == "2002") %>%
  ggplot(aes(x = PM2.5)) +
  geom_histogram(aes(y = after_stat(count/sum(count) * 100)), binwidth = binwidth,
                 fill = "red", alpha = 0.5, color = "red") +
  coord_cartesian(ylim = c(0, ymax)) +
  geom_vline(xintercept = mean_2002,   linetype = "dashed", color = "red") +
  geom_vline(xintercept = median_2002, linetype = "solid",  color = "red") +
  coord_cartesian(xlim = c(xmin, xmax)) +
  labs(
    title = "PM2.5 Histogram – 2002",
    x = "Daily Mean PM2.5 Concentration (µg/m³)",
    y = "Percentage (%)"
  ) +
  theme_bw()
Coordinate system already present. Adding new coordinate system, which will
replace the existing one.
p2 <- df %>%
  filter(Year == "2022") %>%
  ggplot(aes(x = PM2.5)) +
  geom_histogram(aes(y = after_stat(count/sum(count) * 100)), binwidth = binwidth,
                 fill = "blue", alpha = 0.5, color = "blue") +
  coord_cartesian(ylim = c(0, ymax)) +
  geom_vline(xintercept = mean_2022,   linetype = "dashed", color = "blue") +
  geom_vline(xintercept = median_2022, linetype = "solid",  color = "blue") +
  coord_cartesian(xlim = c(xmin, xmax)) +
  labs(
    title = "PM2.5 Histogram – 2022",
    x = "Daily Mean PM2.5 Concentration (µg/m³)",
    y = "Percentage (%)"
  ) +
  theme_bw()
Coordinate system already present. Adding new coordinate system, which will
replace the existing one.
p1 / p2

Summary: I have already presented the violin plot in Question 1 and provided summary statistics at the state level in Question 4. In this question, I used histograms to compare the daily PM2.5 concentrations between 2002 and 2022. The solid line represents the mean value and dash line is median.

Overall, at the state level, the mean daily PM2.5 concentration decreased in 2022 compared to 2002. When focusing on the distribution within the 0–100 µg/m³ range, the histograms further show that a larger proportion of days in 2022 fall into lower PM2.5 concentration levels, indicating an overall improvement in air quality.

Level 2: County

Examine the primary question for every county in California.

pm_county <- df %>%
  group_by(Year, County) %>%
  summarise(
    mean   = mean(PM2.5, na.rm = TRUE),
    median = median(PM2.5, na.rm = TRUE),
    .groups = "drop"
  )

pm_county_summary <- pm_county %>%
  pivot_wider(
    names_from  = Year,
    values_from = c(mean, median),
    names_glue  = "{Year}_{.value}") %>%
  mutate(diff_mean = `2022_mean` - `2002_mean`,
         diff_median = `2022_median` - `2002_median`)

pm_county_summary
# A tibble: 51 × 7
   County       `2002_mean` `2022_mean` `2002_median` `2022_median` diff_mean
   <chr>              <dbl>       <dbl>         <dbl>         <dbl>     <dbl>
 1 Alameda            14.3         8.20         10              7      -6.05 
 2 Butte              14.8         6.19         11.5            4.5    -8.57 
 3 Calaveras           9.9         6.04          8              5      -3.86 
 4 Colusa             11.7         7.61          9              6.7    -4.12 
 5 Contra Costa       15.1         8.24          9.5            7.2    -6.85 
 6 Del Norte           3.82        4.98          3.45           4.2     1.16 
 7 El Dorado           4.91        4.07          3.95           2.7    -0.847
 8 Fresno             19.9        10.2          13.0            7.7    -9.74 
 9 Humboldt            7.79        6.76          6.8            5.7    -1.04 
10 Imperial           12.7         9.55         11.1            8.3    -3.16 
# ℹ 41 more rows
# ℹ 1 more variable: diff_median <dbl>
pd <- position_dodge2(width = 0.5, reverse = TRUE, preserve = "single")

df %>%
  ggplot(aes(x = County, y = PM2.5, 
             color = factor(Year, levels = c("2002", "2022")), group = Year)) +
  stat_summary(fun.data = mean_sdl, geom = "errorbar", position = pd,
               width = 0.3, na.rm = TRUE) +
  stat_summary(fun.data = mean_sdl, geom = "point", position = pd, size = 2,
               na.rm = TRUE
  ) +
  labs(
    title = "Median Daily PM2.5 Concentration by County (2002 vs 2022)",
    x = "County", y = "PM2.5", color = "Year"
  ) +
  scale_color_manual(
    values = c("2002" = "red", "2022" = "blue"),
    breaks = c("2002", "2022")
  ) +
  theme_bw() +
  coord_flip() +
  theme(axis.text.y = element_text(size = 12))

Summary: From the summary dataset of PM2.5 by county, we can observe that both the mean and median PM2.5 levels decreased from 2002 to 2022, indicating an improvement in air quality. From the plot, however, a few counties show an increase in median PM2.5, including Trinity, Siskiyou, Nevada, Mono, Mendocino, and Del Norte. In addition, some counties only have data available for one of the two years. Overall, the findings suggest a general improvement in statewide air quality, while highlighting localized areas where PM2.5 levels have not improved or have even worsened.

Level 3: City

Restrict the data to sites in Los Angeles county and examine the primary question for every site.

pm_city <- df %>%
  filter(County == "Los Angeles" & !is.na(Local.Site.Name) & Local.Site.Name != "") %>%
  group_by(Year, Local.Site.Name) %>%
  summarise(
    mean   = mean(PM2.5, na.rm = TRUE),
    median = median(PM2.5, na.rm = TRUE),
    .groups = "drop"
  )

pm_city_summary <- pm_city %>%
  pivot_wider(
    names_from  = Year,
    values_from = c(mean, median),
    names_glue  = "{Year}_{.value}") %>%
  mutate(diff_mean = `2022_mean` - `2002_mean`,
         diff_median = `2022_median` - `2002_median`)

pm_city_summary
# A tibble: 17 × 7
   Local.Site.Name `2002_mean` `2022_mean` `2002_median` `2022_median` diff_mean
   <chr>                 <dbl>       <dbl>         <dbl>         <dbl>     <dbl>
 1 Azusa                 20.8         9.72          18.7          9.65    -11.0 
 2 Burbank               24.0        NA             21.6         NA        NA   
 3 Lancaster-Divi…       10.4         7.52          10            7.3      -2.86
 4 Lebec                  4.82        3.50           4.8          3.4      -1.32
 5 Long Beach (No…       19.5         9.92          16.8          9.6      -9.55
 6 Los Angeles-No…       22.0        11.6           19.3         10.9     -10.4 
 7 Lynwood               23.3        NA             19.8         NA        NA   
 8 Pasadena              20.3         9.09          17.8          7.9     -11.2 
 9 Reseda                18.9        10.7           17.0         10.3      -8.14
10 Compton               NA          13.0           NA           11.9      NA   
11 Glendora              NA           8.43          NA            7.8      NA   
12 Long Beach (So…       NA          12.0           NA           11.3      NA   
13 Long Beach-Rou…       NA          13.0           NA           12.1      NA   
14 North Hollywoo…       NA          13.0           NA           13.1      NA   
15 Pico Rivera #2        NA          11.4           NA           10.0      NA   
16 Santa Clarita         NA           9.14          NA            8.5      NA   
17 Signal Hill (L…       NA           8.86          NA            8.5      NA   
# ℹ 1 more variable: diff_median <dbl>
df %>%
  filter(County == "Los Angeles" & !is.na(Local.Site.Name) & Local.Site.Name != "") %>%
  group_by(Local.Site.Name, Year) %>%
  summarise(median_pm25 = median(PM2.5, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = Local.Site.Name, y = median_pm25,
             color = factor(Year), group = Local.Site.Name)) +
  geom_point(position = position_dodge(width = 0.5), size = 3) +
  geom_line(aes(group = Local.Site.Name), color = "grey60") +
  labs(x = "Local Site Name", y = "Median PM2.5", color = "Year") +
  scale_color_manual(
    values = c("2002" = "red", "2022" = "blue")
  ) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Summary: From the summary dataset of PM2.5 by local sites in Los Angeles, we can see that some site names are missing, and many sites only have data for one of the two years. This suggests that local monitoring stations may have changed locations over time. For sites with data available in both years, both the mean and median PM2.5 values generally show a decrease, indicating improved air quality. This trend can also be observed in the plot of daily median values. However, due to the limited number of sites with complete data, the evidence may not be sufficient to fully support the conclusion that air quality consistently improved across Los Angeles between 2002 and 2022.